Stemming for Kurdish Information Retrieval

نویسندگان

  • Shahin Salavati
  • Kyumars Sheykh Esmaili
  • Fardin Akhlaghian
چکیده

Resource scarcity along with diversity –in both dialect and script– are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by building stemmers for the two main dialects of the Kurdish language (i.e. Sorani and Kurmanji) and investigate their effectiveness on Kurdish Information Retrieval. More specifically, we build Jedar, the first rule-based stemmer for both Sorani and Kurmanji. We also implement GRAS –as a state-of-the-art statistical stemming technique– and apply it to both of the Kurdish dialects. We then conduct a comprehensive experimental study to compare the effectiveness of these stemmers. Our experimental results show that stemming can significantly –up to %35– improve the retrieval performance on Kurdish documents. Furthermore, they indicate that the gains from the rule-based and the statistical approaches are comparable.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Towards Kurdish Information Retrieval

The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in the Middle East. Despite having a large number of speakers, Kurdish is among the less-resourced languages and has not seen much attention from the IR and NLP research communities. This paper reports on the outcomes of a project aimed at providing essential resources for processing Kurdish texts...

متن کامل

بررسی تأثیرات ریشه‌یابی در بازیابی اطلاعات در زبان فارسی

Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...

متن کامل

Towards Building KurdNet, the Kurdish WordNet

In this paper we highlight the main challenges in building a lexical database for Kurdish, a resource-scarce and diverse language. We also report on our effort in building the first prototype of KurdNet – the Kurdish WordNet– along with a preliminary evaluation of its impact on Kurdish information retrieval.

متن کامل

Improving Precision in Information Retrieval for Swedish using Stemming

We will in this paper present an evaluation of how much stemming improves precision in information retrieval for Swedish texts. To perform this, we built an information retrieval tool with optional stemming and created a tagged corpus in Swedish. We know that stemming in information retrieval for English, Dutch and Slovenian gives better precision the more inflecting the language is, but precis...

متن کامل

Effective Stemming for Arabic Information Retrieval

Arabic has a very rich and complex morphology. Its appropriate morphological processing is very important for Information Retrieval (IR). In this paper, we propose a new stemming technique that tries to determine the stem of a word representing the semantic core of this word according to Arabic morphology. This method is compared to a commonly used light stemming technique which truncates a wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013